1 基本介紹

1.1 什麼是 maf 格式 (Mutation Annotation Format)?

MAF(Mutation Annotation Format)是一種廣泛使用的突變註解檔格式,常見於 TCGA、ICGC 與各大癌症基因體研究專案。
每一列代表一個 somatic variant,並包含以下主要欄位:

  • Hugo_Symbol:基因名稱
  • Entrez_Gene_Id:Entrez 基因 ID
  • Chromosome / Start_Position / End_Position:變異所在的基因組座標
  • Variant_Classification:變異功能類型(Missense, Nonsense, Splice_Site…)
  • Variant_Type:SNP / INS / DEL
  • Tumor_Seq_Allele1 / Tumor_Seq_Allele2:腫瘤樣本的等位基因
  • Tumor_Sample_Barcode:樣本 ID
  • Protein_Change:蛋白層級變異(如 p.R1517H)
  • i_TumorVAF_WU:突變的 variant allele frequency

1.2 如何產生 maf

  • 原檔為 VCF 可以使用 vcf2maf 將 vcf 轉 maf
  • GATK的 funcotator
  • ANNOVAR 註釋後,可使用 maftools 中的 annovarToMaf 函數,將表格形式的 ANNOVAR 輸出轉換為 MAF。

1.3 什麼是 maftools?

maftools 是一個用於分析並視覺化 MAF 檔案的套件。

1.4 參考資料

以下指令是根據官方文檔範例撰寫,並新增個人的筆記內容,所有相關著作權屬於原作者(如下):


Mayakonda A, Lin DC, Assenov Y, Plass C, Koeffler HP. 2018. Maftools: efficient and comprehensive analysis of somatic variants in cancer. Genome Research. PMID: 30341162


2 下載 maftools 套件

if (!requireNamespace("BiocManager", quietly = TRUE)) {
    install.packages("BiocManager")
}

if (!requireNamespace("maftools", quietly = TRUE)) {
    BiocManager::install("maftools", update = FALSE, ask = FALSE)
}

# create `results/` folder for output data
if (!dir.exists("results")) dir.create("results", recursive = TRUE)

3 讀取 maf 檔

使用 read.maf 函數

library(maftools)

#path to TCGA LAML MAF file
laml.maf = system.file('extdata', 'tcga_laml.maf.gz', package = 'maftools') 
#clinical information containing survival information and histology. This is optional
laml.clin = system.file('extdata', 'tcga_laml_annot.tsv', package = 'maftools') 

laml = read.maf(maf = laml.maf, clinicalData = laml.clin)
## -Reading
## -Validating
## -Silent variants: 475 
## -Summarizing
## -Processing clinical data
## -Finished in 0.155s elapsed (0.490s cpu)

3.1 查看 MAF object

#Typing laml shows basic summary of MAF file.
laml
## An object of class  MAF 
##                    ID          summary  Mean Median
##                <char>           <char> <num>  <num>
##  1:        NCBI_Build               37    NA     NA
##  2:            Center genome.wustl.edu    NA     NA
##  3:           Samples              193    NA     NA
##  4:            nGenes             1241    NA     NA
##  5:   Frame_Shift_Del               52 0.269      0
##  6:   Frame_Shift_Ins               91 0.472      0
##  7:      In_Frame_Del               10 0.052      0
##  8:      In_Frame_Ins               42 0.218      0
##  9: Missense_Mutation             1342 6.953      7
## 10: Nonsense_Mutation              103 0.534      0
## 11:       Splice_Site               92 0.477      0
## 12:             total             1732 8.974      9

3.2 匯出 maf summary 檔案

#Shows sample summry.
getSampleSummary(laml)
##      Tumor_Sample_Barcode Frame_Shift_Del Frame_Shift_Ins In_Frame_Del
##                    <fctr>           <int>           <int>        <int>
##   1:         TCGA-AB-3009               0               5            0
##   2:         TCGA-AB-2807               1               0            1
##   3:         TCGA-AB-2959               0               0            0
##   4:         TCGA-AB-3002               0               0            0
##   5:         TCGA-AB-2849               0               1            0
##  ---                                                                  
## 189:         TCGA-AB-2942               0               0            0
## 190:         TCGA-AB-2946               0               0            0
## 191:         TCGA-AB-2954               0               0            0
## 192:         TCGA-AB-2982               0               0            0
## 193:         TCGA-AB-2903               0               0            0
##      In_Frame_Ins Missense_Mutation Nonsense_Mutation Splice_Site total
##             <int>             <int>             <int>       <int> <num>
##   1:            1                25                 2           1    34
##   2:            0                16                 3           4    25
##   3:            0                22                 0           1    23
##   4:            0                15                 1           5    21
##   5:            0                16                 1           2    20
##  ---                                                                   
## 189:            1                 0                 0           0     1
## 190:            0                 1                 0           0     1
## 191:            0                 1                 0           0     1
## 192:            0                 1                 0           0     1
## 193:            0                 0                 0           0     0
#Shows gene summary.
getGeneSummary(laml)
##       Hugo_Symbol Frame_Shift_Del Frame_Shift_Ins In_Frame_Del In_Frame_Ins
##            <char>           <int>           <int>        <int>        <int>
##    1:        FLT3               0               0            1           33
##    2:      DNMT3A               4               0            0            0
##    3:        NPM1               0              33            0            0
##    4:        IDH2               0               0            0            0
##    5:        IDH1               0               0            0            0
##   ---                                                                      
## 1237:      ZNF689               0               0            0            0
## 1238:      ZNF75D               0               0            0            0
## 1239:      ZNF827               1               0            0            0
## 1240:       ZNF99               0               0            0            0
## 1241:        ZPBP               0               0            0            0
##       Missense_Mutation Nonsense_Mutation Splice_Site total MutatedSamples
##                   <int>             <int>       <int> <num>          <int>
##    1:                15                 0           3    52             52
##    2:                39                 5           6    54             48
##    3:                 1                 0           0    34             33
##    4:                20                 0           0    20             20
##    5:                18                 0           0    18             18
##   ---                                                                     
## 1237:                 1                 0           0     1              1
## 1238:                 1                 0           0     1              1
## 1239:                 0                 0           0     1              1
## 1240:                 1                 0           0     1              1
## 1241:                 1                 0           0     1              1
##       AlteredSamples
##                <int>
##    1:             52
##    2:             48
##    3:             33
##    4:             20
##    5:             18
##   ---               
## 1237:              1
## 1238:              1
## 1239:              1
## 1240:              1
## 1241:              1
#shows clinical data associated with samples
getClinicalData(laml)
##      Tumor_Sample_Barcode FAB_classification days_to_last_followup
##                    <char>             <char>                 <num>
##   1:         TCGA-AB-2802                 M4                   365
##   2:         TCGA-AB-2803                 M3                   792
##   3:         TCGA-AB-2804                 M3                  2557
##   4:         TCGA-AB-2805                 M0                   577
##   5:         TCGA-AB-2806                 M1                   945
##  ---                                                              
## 189:         TCGA-AB-3007                 M3                  1581
## 190:         TCGA-AB-3008                 M1                   822
## 191:         TCGA-AB-3009                 M4                   577
## 192:         TCGA-AB-3011                 M1                  1885
## 193:         TCGA-AB-3012                 M3                  1887
##      Overall_Survival_Status
##                        <int>
##   1:                       1
##   2:                       1
##   3:                       0
##   4:                       1
##   5:                       1
##  ---                        
## 189:                       0
## 190:                       1
## 191:                       1
## 192:                       0
## 193:                       0
#Shows all fields in MAF
getFields(laml)
##  [1] "Hugo_Symbol"            "Entrez_Gene_Id"         "Center"                
##  [4] "NCBI_Build"             "Chromosome"             "Start_Position"        
##  [7] "End_Position"           "Strand"                 "Variant_Classification"
## [10] "Variant_Type"           "Reference_Allele"       "Tumor_Seq_Allele1"     
## [13] "Tumor_Seq_Allele2"      "Tumor_Sample_Barcode"   "Protein_Change"        
## [16] "i_TumorVAF_WU"          "i_transcript_name"
#Writes maf summary to an output file with basename laml.
write.mafSummary(maf = laml, basename = 'results/laml')

4 視覺化

4.1 summary plot

plotmafSummary 可以繪製整體資料的摘要,用於快速釐清資料

plotmafSummary(maf = laml, rmOutlier = TRUE, addStat = 'median', dashboard = TRUE, titvRaw = FALSE)

4.2 bar plot

mafbarplot 用於繪製選定基因變異種類之比率

mafbarplot(maf = laml, genes = c("DNMT3A", "IDH1", "IDH2"))

4.3 oncoplot plot

oncoplot 繪製樣本中的變異基因與變異種類 (Multi_Hit 是指在同一樣本中發生多次突變的基因)

#oncoplot for top ten mutated genes.
oncoplot(maf = laml, top = 10)

4.4 Transition and Transversions

在突變類型中,單一鹼基替換(Single Nucleotide Variant, SNV)可依照其替換的化學性質分為兩大類:

1. Transition (Ti) Purine ↔︎ Purine 或 Pyrimidine ↔︎ Pyrimidine 的替換,包括: - A ↔︎ G(兩者皆為 purine) - C ↔︎ T(兩者皆為 pyrimidine)

Transition 牽涉的化學變化較小,因此通常比 transversion 更常見。

2. Transversion (Tv) Purine ↔︎ Pyrimidine 的替換,包括: - A ↔︎ C - A ↔︎ T - G ↔︎ C - G ↔︎ T

由於涉及較大的化學結構變化,發生頻率通常較低。

maftools 提供 titv() 函式來計算 Transition 與 Transversion 的數量與比例,
並可利用 plotTiTv() 將結果視覺化。

laml.titv = titv(maf = laml, plot = FALSE, useSyn = TRUE)
#plot titv summary
plotTiTv(res = laml.titv)

4.5 胺基酸變化的 Lollipop plots

#lollipop plot for DNMT3A, which is one of the most frequent mutated gene in Leukemia.
lollipopPlot(
  maf = laml,
  gene = 'DNMT3A',
  AACol = 'Protein_Change',
  showMutationRate = TRUE,
  labelPos = 882
)
## 3 transcripts available. Use arguments refSeqID or proteinID to manually specify tx name.
##      HGNC refseq.ID protein.ID aa.length
##    <char>    <char>     <char>     <num>
## 1: DNMT3A NM_022552  NP_072046       912
## 2: DNMT3A NM_153759  NP_715640       723
## 3: DNMT3A NM_175629  NP_783328       912
## Using longer transcript NM_022552 for now.
## Removed 3 mutations for which AA position was not available

4.5.1 自訂資料作為輸入

資料應包含變異位置(pos)與變異數量(count)兩個欄位

#example data
my_data = data.frame(pos = sample.int(912, 15, replace = TRUE), count = sample.int(30, 15, replace = TRUE))
head(my_data)
##   pos count
## 1 664     9
## 2 337    27
## 3 153    22
## 4 671    14
## 5 205     6
## 6 443    12
lollipopPlot(data = my_data, gene = "DNMT3A")
## 3 transcripts available. Use arguments refSeqID or proteinID to manually specify tx name.
##      HGNC refseq.ID protein.ID aa.length
##    <char>    <char>     <char>     <num>
## 1: DNMT3A NM_022552  NP_072046       912
## 2: DNMT3A NM_153759  NP_715640       723
## 3: DNMT3A NM_175629  NP_783328       912
## Using longer transcript NM_022552 for now.

plotProtein(gene = "TP53", refSeqID = "NM_000546")

4.6 Rainfall plots

rainfallPlot 繪製 Rainfall plots 觀察 mutation hot spot - Kataegis 定義為包含六個或更多連續突變的基因組片段,平均突變間距離小於或等於1,00 bp

brca <- system.file("extdata", "brca.maf.gz", package = "maftools")
brca = read.maf(maf = brca, verbose = FALSE)

rainfallPlot(maf = brca, detectChangePoints = TRUE, pointSize = 0.4)
## Processing TCGA-A8-A08B..
## Kataegis detected at:
##    Chromosome Start_Position End_Position nMuts Avg_intermutation_dist  Size
##         <num>          <num>        <num> <int>                  <num> <num>
## 1:          8       98129348     98133560     7               702.0000  4212
## 2:          8       98398549     98403536     9               623.3750  4987
## 3:          8       98453076     98456466     9               423.7500  3390
## 4:          8      124090377    124096810    22               306.3333  6433
## 5:         12       97436055     97439705     7               608.3333  3650
## 6:         17       29332072     29336153     8               583.0000  4081
##    Tumor_Sample_Barcode   C>G   C>T
##                  <fctr> <int> <int>
## 1:         TCGA-A8-A08B     4     3
## 2:         TCGA-A8-A08B     1     8
## 3:         TCGA-A8-A08B     1     8
## 4:         TCGA-A8-A08B     1    21
## 5:         TCGA-A8-A08B     4     3
## 6:         TCGA-A8-A08B     4     4

file.rename(list.files(pattern="\\.tsv$"), file.path("results", list.files(pattern="\\.tsv$")))
## [1] TRUE

4.7 與 33個 TCGA cohorts 比較 mutation load

laml.mutload = tcgaCompare(maf = laml, cohortName = 'Example-LAML', logscale = TRUE, capture_size = 50)
## Warning in FUN(X[[i]], ...): Removed 1 samples with zero mutations.
## Capture size [TCGA]:  35.8
## Capture size [Input]: 50
## Performing pairwise t-test for differences in mutation burden (per MB)..

4.8 VAF plot

plotVaf 繪製 Variant Allele Frequencies,快速估計突變最嚴重的基因的克隆狀態

plotVaf(maf = laml, vafCol = 'i_TumorVAF_WU')

5 處理 copy-number 資料

5.1 讀取 GISTIC 的輸出檔

gistic_res_folder <- system.file("extdata", package = "maftools")
laml.gistic = readGistic(gisticDir = gistic_res_folder, isTCGA = TRUE)
## -Processing Gistic files..
## --Processing amp_genes.conf_99.txt
## --Processing del_genes.conf_99.txt
## --Processing scores.gistic
## --Summarizing by samples
#GISTIC object
laml.gistic
## An object of class  GISTIC 
##           ID summary
##       <char>   <num>
## 1:   Samples     191
## 2:    nGenes    2622
## 3: cytoBands      16
## 4:       Amp     388
## 5:       Del   26481
## 6:     total   26869

5.2 視覺化 GISTIC 結果

5.2.1 繪製 GISTIC Chrom Plot

gisticChromPlot(gistic = laml.gistic, markBands = "all")

# Co-gisticChromPlot
coGisticChromPlot(gistic1 = laml.gistic, gistic2 = laml.gistic, g1Name = "AML-1", g2Name = "AML-2", type = 'Amp')

5.2.2 繪製 Bubble Plot

gisticBubblePlot(gistic = laml.gistic)

5.2.3 繪製 Onco Plot

gisticOncoPlot(gistic = laml.gistic, clinicalData = getClinicalData(x = laml), clinicalFeatures = 'FAB_classification', sortByAnnotation = TRUE, top = 10)

5.3 CBS segments

5.3.1 總結 chromosomal arm level CN

laml.seg <- system.file("extdata", "LAML_CBS_segments.tsv.gz", package = "maftools")
segSummarize_results = segSummarize(seg = laml.seg)

## Recurrent chromosomal arm aberrations
##        arm Variant_Classification     N
##     <char>                 <char> <int>
##  1:    5_q                   Loss     9
##  2:   16_q                   Loss     2
##  3:   21_q                   Gain     2
##  4:    7_q                   Loss     2
##  5:    1_p                   Gain     1
##  6:    1_q                   Gain     1
##  7:   11_p                   Loss     1
##  8:   11_q                   Gain     1
##  9:   12_p                   Loss     1
## 10:   18_p                   Loss     1
## 11:   19_q                   Gain     1
## 12:   19_p                   Gain     1
## 13:   20_q                   Loss     1
## 14:    3_p                   Loss     1
## 15:    7_p                   Loss     1
## 16:    8_q                   Gain     1
## 17:    9_q                   Loss     1

5.3.2 視覺化 CBS segments

tcga.ab.009.seg <- system.file("extdata", "TCGA.AB.3009.hg19.seg.txt", package = "maftools")
plotCBSsegments(cbsFile = tcga.ab.009.seg)
## No 'tsb' specified, plot head 1 sample. Set tsb='ALL' to plot all samples.

6 綜合分析

6.1 體細胞相互作用

#exclusive/co-occurance event analysis on top 10 mutated genes. 
somaticInteractions(maf = laml, top = 25, pvalue = c(0.05, 0.1))

##       gene1  gene2       pValue oddsRatio    00    01    11    10        pAdj
##      <char> <char>        <num>     <num> <int> <int> <int> <int>       <num>
##   1:  ASXL1  RUNX1 0.0001541586 55.215541   176    12     4     1 0.003568486
##   2:   IDH2  RUNX1 0.0002809928  9.590877   164     9     7    13 0.006055880
##   3:   IDH2  ASXL1 0.0004030636 41.077327   172     1     4    16 0.008126283
##   4:   FLT3   NPM1 0.0009929836  3.763161   125    16    17    35 0.018664260
##   5:   SMC3 DNMT3A 0.0010451985 20.177713   144    42     6     1 0.018664260
##  ---                                                                         
## 296:  PLCE1  ASXL1 1.0000000000  0.000000   184     5     0     4 1.000000000
## 297:  RAD21  FAM5C 1.0000000000  0.000000   183     5     0     5 1.000000000
## 298:  PLCE1  FAM5C 1.0000000000  0.000000   184     5     0     4 1.000000000
## 299:  PLCE1  RAD21 1.0000000000  0.000000   184     5     0     4 1.000000000
## 300:   EZH2  PLCE1 1.0000000000  0.000000   186     4     0     3 1.000000000
##                   Event         pair event_ratio
##                  <char>       <char>      <char>
##   1:       Co_Occurence ASXL1, RUNX1        4/13
##   2:       Co_Occurence  IDH2, RUNX1        7/22
##   3:       Co_Occurence  ASXL1, IDH2        4/17
##   4:       Co_Occurence   FLT3, NPM1       17/51
##   5:       Co_Occurence DNMT3A, SMC3        6/43
##  ---                                            
## 296: Mutually_Exclusive ASXL1, PLCE1         0/9
## 297: Mutually_Exclusive FAM5C, RAD21        0/10
## 298: Mutually_Exclusive FAM5C, PLCE1         0/9
## 299: Mutually_Exclusive PLCE1, RAD21         0/9
## 300: Mutually_Exclusive  EZH2, PLCE1         0/7

6.2 使用 cluster 偵測 cancer driver genes

laml.sig = oncodrive(maf = laml, AACol = 'Protein_Change', minMut = 5, pvalMethod = 'zscore')
## Warning in oncodrive(maf = laml, AACol = "Protein_Change", minMut = 5,
## pvalMethod = "zscore"): Oncodrive has been superseeded by OncodriveCLUSTL. See
## http://bg.upf.edu/group/projects/oncodrive-clust.php
## Estimating background scores from synonymous variants..
## Not enough genes to build background. Using predefined values. (Mean = 0.279; SD = 0.13)
## Estimating cluster scores from non-syn variants..
##   |                                                                              |                                                                      |   0%  |                                                                              |===                                                                   |   4%  |                                                                              |======                                                                |   9%  |                                                                              |=========                                                             |  13%  |                                                                              |============                                                          |  17%  |                                                                              |===============                                                       |  22%  |                                                                              |==================                                                    |  26%  |                                                                              |=====================                                                 |  30%  |                                                                              |========================                                              |  35%  |                                                                              |===========================                                           |  39%  |                                                                              |==============================                                        |  43%  |                                                                              |=================================                                     |  48%  |                                                                              |=====================================                                 |  52%  |                                                                              |========================================                              |  57%  |                                                                              |===========================================                           |  61%  |                                                                              |==============================================                        |  65%  |                                                                              |=================================================                     |  70%  |                                                                              |====================================================                  |  74%  |                                                                              |=======================================================               |  78%  |                                                                              |==========================================================            |  83%  |                                                                              |=============================================================         |  87%  |                                                                              |================================================================      |  91%  |                                                                              |===================================================================   |  96%  |                                                                              |======================================================================| 100%
## Comapring with background model and estimating p-values..
## Done !
head(laml.sig)
##    Hugo_Symbol Frame_Shift_Del Frame_Shift_Ins In_Frame_Del In_Frame_Ins
##         <char>           <int>           <int>        <int>        <int>
## 1:        IDH1               0               0            0            0
## 2:        IDH2               0               0            0            0
## 3:        NPM1               0              33            0            0
## 4:        NRAS               0               0            0            0
## 5:       U2AF1               0               0            0            0
## 6:         KIT               1               1            0            1
##    Missense_Mutation Nonsense_Mutation Splice_Site total MutatedSamples
##                <int>             <int>       <int> <num>          <int>
## 1:                18                 0           0    18             18
## 2:                20                 0           0    20             20
## 3:                 1                 0           0    34             33
## 4:                15                 0           0    15             15
## 5:                 8                 0           0     8              8
## 6:                 7                 0           0    10              8
##    AlteredSamples clusters muts_in_clusters clusterScores protLen   zscore
##             <int>    <int>            <int>         <num>   <int>    <num>
## 1:             18        1               18     1.0000000     414 5.546154
## 2:             20        2               20     1.0000000     452 5.546154
## 3:             33        2               32     0.9411765     294 5.093665
## 4:             15        2               15     0.9218951     189 4.945347
## 5:              8        1                7     0.8750000     240 4.584615
## 6:              8        2                9     0.8500000     976 4.392308
##            pval          fdr fract_muts_in_clusters
##           <num>        <num>                  <num>
## 1: 1.460110e-08 1.022077e-07              1.0000000
## 2: 1.460110e-08 1.022077e-07              1.0000000
## 3: 1.756034e-07 8.194826e-07              0.9411765
## 4: 3.800413e-07 1.330144e-06              1.0000000
## 5: 2.274114e-06 6.367520e-06              0.8750000
## 6: 5.607691e-06 1.308461e-05              0.9000000
plotOncodrive(res = laml.sig, fdrCutOff = 0.1, useFraction = TRUE, labelSize = 0.5)

6.3 新增 pfam Domains

laml.pfam = pfamDomains(maf = laml, AACol = 'Protein_Change', top = 10)
## Warning in pfamDomains(maf = laml, AACol = "Protein_Change", top = 10): Removed
## 50 mutations for which AA position was not available

#Protein summary (Printing first 7 columns for display convenience)
laml.pfam$proteinSummary[,1:7, with = FALSE]
##         HGNC AAPos Variant_Classification     N total  fraction   DomainLabel
##       <char> <num>                 <fctr> <int> <num>     <num>        <char>
##    1: DNMT3A   882      Missense_Mutation    27    54 0.5000000 AdoMet_MTases
##    2:   IDH1   132      Missense_Mutation    18    18 1.0000000      PTZ00435
##    3:   IDH2   140      Missense_Mutation    17    20 0.8500000      PTZ00435
##    4:   FLT3   835      Missense_Mutation    14    52 0.2692308      PKc_like
##    5:   FLT3   599           In_Frame_Ins    10    52 0.1923077      PKc_like
##   ---                                                                        
## 1512: ZNF646   875      Missense_Mutation     1     1 1.0000000          <NA>
## 1513: ZNF687   554      Missense_Mutation     1     2 0.5000000          <NA>
## 1514: ZNF687   363      Missense_Mutation     1     2 0.5000000          <NA>
## 1515: ZNF75D     5      Missense_Mutation     1     1 1.0000000          <NA>
## 1516: ZNF827   427        Frame_Shift_Del     1     1 1.0000000          <NA>
#Domain summary (Printing first 3 columns for display convenience)
laml.pfam$domainSummary[,1:3, with = FALSE]
##        DomainLabel nMuts nGenes
##             <char> <int>  <int>
##   1:      PKc_like    55      5
##   2:      PTZ00435    38      2
##   3: AdoMet_MTases    33      1
##   4:         7tm_1    24     24
##   5:       COG5048    17     17
##  ---                           
## 499:    ribokinase     1      1
## 500:   rim_protein     1      1
## 501: sigpep_I_bact     1      1
## 502:           trp     1      1
## 503:        zf-BED     1      1

6.4 Survival 分析

6.4.1 單一基因

#Survival analysis based on grouping of DNMT3A mutation status
mafSurvival(maf = laml, genes = 'DNMT3A', time = 'days_to_last_followup', Status = 'Overall_Survival_Status', isTCGA = TRUE)
## Looking for clinical data in annoatation slot of MAF..
## Number of mutated samples for given genes:
## DNMT3A 
##     48
## Removed 11 samples with NA's
## Median survival..
##     Group medianTime     N
##    <char>      <num> <int>
## 1: Mutant        245    45
## 2:     WT        396   137

6.4.2 找出導致存活率低的基因集

#Using top 20 mutated genes to identify a set of genes (of size 2) to predict poor prognostic groups
prog_geneset = survGroup(maf = laml, top = 20, geneSetSize = 2, time = "days_to_last_followup", Status = "Overall_Survival_Status", verbose = FALSE)
## Removed 11 samples with NA's
print(prog_geneset)
##     Gene_combination P_value    hr    WT Mutant
##               <char>   <num> <num> <int>  <int>
##  1:      FLT3_DNMT3A 0.00104 2.510   164     18
##  2:      DNMT3A_SMC3 0.04880 2.220   176      6
##  3:      DNMT3A_NPM1 0.07190 1.720   166     16
##  4:      DNMT3A_TET2 0.19600 1.780   176      6
##  5:        FLT3_TET2 0.20700 1.860   177      5
##  6:        NPM1_IDH1 0.21900 0.495   176      6
##  7:      DNMT3A_IDH1 0.29300 1.500   173      9
##  8:       IDH2_RUNX1 0.31800 1.580   176      6
##  9:        FLT3_NPM1 0.53600 1.210   165     17
## 10:      DNMT3A_IDH2 0.68000 0.747   178      4
## 11:      DNMT3A_NRAS 0.99200 0.986   178      4
mafSurvGroup(maf = laml, geneSet = c("DNMT3A", "FLT3"), time = "days_to_last_followup", Status = "Overall_Survival_Status")
## Looking for clinical data in annoatation slot of MAF..
## Removed 11 samples with NA's
## Median survival..
##     Group medianTime     N
##    <char>      <num> <int>
## 1: Mutant      242.5    18
## 2:     WT      379.5   164

6.5 比較兩個 cohort

#Primary APL MAF
primary.apl = system.file("extdata", "APL_primary.maf.gz", package = "maftools")
primary.apl = read.maf(maf = primary.apl)
## -Reading
## -Validating
## --Non MAF specific values in Variant_Classification column:
##   ITD
## -Silent variants: 45 
## -Summarizing
## -Processing clinical data
## --Missing clinical data
## -Finished in 0.054s elapsed (0.199s cpu)
#Relapse APL MAF
relapse.apl = system.file("extdata", "APL_relapse.maf.gz", package = "maftools")
relapse.apl = read.maf(maf = relapse.apl)
## -Reading
## -Validating
## --Non MAF specific values in Variant_Classification column:
##   ITD
## -Silent variants: 19 
## -Summarizing
## -Processing clinical data
## --Missing clinical data
## -Finished in 0.055s elapsed (0.189s cpu)
#Considering only genes which are mutated in at-least in 5 samples in one of the cohort to avoid bias due to genes mutated in single sample.
pt.vs.rt <- mafCompare(m1 = primary.apl, m2 = relapse.apl, m1Name = 'Primary', m2Name = 'Relapse', minMut = 5)
print(pt.vs.rt)
## $results
##    Hugo_Symbol Primary Relapse         pval         or       ci.up      ci.low
##         <char>   <num>   <int>        <num>      <num>       <num>       <num>
## 1:         PML       1      11 1.529935e-05 0.03537381   0.2552937 0.000806034
## 2:        RARA       0       7 2.574810e-04 0.00000000   0.3006159 0.000000000
## 3:       RUNX1       1       5 1.310500e-02 0.08740567   0.8076265 0.001813280
## 4:        FLT3      26       4 1.812779e-02 3.56086275  14.7701728 1.149009169
## 5:      ARID1B       5       8 2.758396e-02 0.26480490   0.9698686 0.064804160
## 6:         WT1      20      14 2.229087e-01 0.60619329   1.4223101 0.263440988
## 7:        KRAS       6       1 4.334067e-01 2.88486293 135.5393108 0.337679367
## 8:        NRAS      15       4 4.353567e-01 1.85209500   8.0373994 0.553883512
## 9:      ARID1A       7       4 7.457274e-01 0.80869223   3.9297309 0.195710173
##         adjPval
##           <num>
## 1: 0.0001376942
## 2: 0.0011586643
## 3: 0.0393149868
## 4: 0.0407875250
## 5: 0.0496511201
## 6: 0.3343630535
## 7: 0.4897762916
## 8: 0.4897762916
## 9: 0.7457273717
## 
## $SampleSummary
##     Cohort SampleSize
##     <char>      <num>
## 1: Primary        124
## 2: Relapse         58

6.5.1 森林圖

forestPlot(mafCompareRes = pt.vs.rt, pVal = 0.1)

6.5.2 並排繪製兩個腫瘤圖

genes = c("PML", "RARA", "RUNX1", "ARID1B", "FLT3")
coOncoplot(m1 = primary.apl, m2 = relapse.apl, m1Name = 'PrimaryAPL', m2Name = 'RelapseAPL', genes = genes, removeNonMutated = TRUE)

6.5.3 並排繪製兩個長條圖

coBarplot(m1 = primary.apl, m2 = relapse.apl, m1Name = "Primary", m2Name = "Relapse")

6.5.4 合併兩個lollipop Plot

lollipopPlot2(m1 = primary.apl, m2 = relapse.apl, gene = "PML", AACol1 = "amino_acid_change", AACol2 = "amino_acid_change", m1_name = "Primary", m2_name = "Relapse")
## Gene: PML
## 9 transcripts available. Use arguments refSeqID or proteinID to manually specify tx name.
##      HGNC refseq.ID protein.ID aa.length
##    <char>    <char>     <char>     <num>
## 1:    PML NM_002675  NP_002666       633
## 2:    PML NM_033238  NP_150241       882
## 3:    PML NM_033239  NP_150242       829
## 4:    PML NM_033240  NP_150243       611
## 5:    PML NM_033244  NP_150247       560
## 6:    PML NM_033246  NP_150249       423
## 7:    PML NM_033247  NP_150250       435
## 8:    PML NM_033249  NP_150252       585
## 9:    PML NM_033250  NP_150253       781
## Using longer transcript NM_033238 for now.
## 9 transcripts available. Use arguments refSeqID or proteinID to manually specify tx name.
##      HGNC refseq.ID protein.ID aa.length
##    <char>    <char>     <char>     <num>
## 1:    PML NM_002675  NP_002666       633
## 2:    PML NM_033238  NP_150241       882
## 3:    PML NM_033239  NP_150242       829
## 4:    PML NM_033240  NP_150243       611
## 5:    PML NM_033244  NP_150247       560
## 6:    PML NM_033246  NP_150249       423
## 7:    PML NM_033247  NP_150250       435
## 8:    PML NM_033249  NP_150252       585
## 9:    PML NM_033250  NP_150253       781
## Using longer transcript NM_033238 for now.

6.6 臨床富集分析

fab.ce = clinicalEnrichment(maf = laml, clinicalFeature = 'FAB_classification')
## Sample size per factor in FAB_classification:
## 
## M0 M1 M2 M3 M4 M5 M6 M7 
## 19 44 44 21 39 19  3  3
#Results are returned as a list. Significant associations p-value < 0.05
fab.ce$groupwise_comparision[p_value < 0.05]
##    Hugo_Symbol Group1 Group2 n_mutated_group1 n_mutated_group2      p_value
##         <char> <char> <char>           <char>           <char>        <num>
## 1:        IDH1     M1   Rest         11 of 44         7 of 149 0.0002597371
## 2:        TP53     M7   Rest           3 of 3        12 of 190 0.0003857187
## 3:      DNMT3A     M5   Rest         10 of 19        38 of 174 0.0089427384
## 4:       CEBPA     M2   Rest          7 of 44         6 of 149 0.0117352110
## 5:       RUNX1     M0   Rest          5 of 19        11 of 174 0.0117436825
## 6:        NPM1     M5   Rest          7 of 19        26 of 174 0.0248582372
## 7:        NPM1     M3   Rest          0 of 21        33 of 172 0.0278630823
## 8:      DNMT3A     M3   Rest          1 of 21        47 of 172 0.0294005111
##          OR      OR_low    OR_high       fdr
##       <num>       <num>      <num>     <num>
## 1: 6.670592 2.173829026 21.9607250 0.0308575
## 2:      Inf 5.355415451        Inf 0.0308575
## 3: 3.941207 1.333635173 11.8455979 0.3757978
## 4: 4.463237 1.204699322 17.1341278 0.3757978
## 5: 5.216902 1.243812880 19.4051505 0.3757978
## 6: 3.293201 1.001404899 10.1210509 0.5880102
## 7: 0.000000 0.000000000  0.8651972 0.5880102
## 8: 0.133827 0.003146708  0.8848897 0.5880102
plotEnrichmentResults(enrich_res = fab.ce, pVal = 0.05, geneFontSize = 0.5, annoFontSize = 0.6)

6.7 藥物-基因交互作用

dgi = drugInteractions(maf = laml, fontSize = 0.75)

dnmt3a.dgi = drugInteractions(genes = "DNMT3A", drugs = TRUE)
## Number of claimed drugs for given genes:
##      Gene     N
##    <char> <int>
## 1: DNMT3A     7
#Printing selected columns.
dnmt3a.dgi[,.(Gene, interaction_types, drug_name, drug_claim_name)]
##      Gene interaction_types    drug_name drug_claim_name
##    <char>            <char>       <char>          <char>
## 1: DNMT3A                                            N/A
## 2: DNMT3A                   DAUNORUBICIN    Daunorubicin
## 3: DNMT3A                     DECITABINE      Decitabine
## 4: DNMT3A                     IDARUBICIN      IDARUBICIN
## 5: DNMT3A                     DECITABINE      DECITABINE
## 6: DNMT3A         inhibitor   DECITABINE   CHEMBL1201129
## 7: DNMT3A         inhibitor  AZACITIDINE      CHEMBL1489

6.8 致癌 pathway

pws = pathways(maf = laml, plotType = 'treemap')
## Summarizing signalling pathways [Sanchez-Vega et al., https://doi.org/10.1016/j.cell.2018.03.035]

plotPathways(maf = laml, pathlist = pws)

6.9 腫瘤異質性分析

6.9.1 需要 mclust 套件

if (!requireNamespace("mclust", quietly = TRUE)) {
    BiocManager::install("mclust", update = FALSE, ask = FALSE)
}

#Heterogeneity in sample TCGA.AB.2972
library("mclust")
## Package 'mclust' version 6.1.2
## Type 'citation("mclust")' for citing this R package in publications.

6.9.2 腫瘤樣本的異質性

tcga.ab.2972.het = inferHeterogeneity(maf = laml, tsb = 'TCGA-AB-2972', vafCol = 'i_TumorVAF_WU')
## Processing TCGA-AB-2972..
print(tcga.ab.2972.het$clusterMeans)
##    Tumor_Sample_Barcode cluster   meanVaf
##                  <fctr>  <char>     <num>
## 1:         TCGA-AB-2972       2 0.4496571
## 2:         TCGA-AB-2972       1 0.2454750
## 3:         TCGA-AB-2972 outlier 0.3695000
#Visualizing results
plotClusters(clusters = tcga.ab.2972.het)

6.9.3 忽略拷貝數變異區域的變異

seg = system.file('extdata', 'TCGA.AB.3009.hg19.seg.txt', package = 'maftools')
tcga.ab.3009.het = inferHeterogeneity(maf = laml, tsb = 'TCGA-AB-3009', segFile = seg, vafCol = 'i_TumorVAF_WU')
## Processing TCGA-AB-3009..
## Removed 1 variants with no copy number data.
##    Hugo_Symbol Chromosome Start_Position End_Position Tumor_Sample_Barcode
##         <char>     <char>          <num>        <num>               <fctr>
## 1:        PHF6         23      133551224    133551224         TCGA-AB-3009
##        t_vaf Segment_Start Segment_End Segment_Mean    CN
##        <num>         <int>       <int>        <num> <num>
## 1: 0.9349112            NA          NA           NA    NA
## Copy number altered variants:
##    Hugo_Symbol Chromosome Start_Position End_Position Tumor_Sample_Barcode
##         <char>     <char>          <num>        <num>               <fctr>
## 1:     NFKBIL2          8      145668658    145668658         TCGA-AB-3009
## 2:         NF1         17       29562981     29562981         TCGA-AB-3009
## 3:       SUZ12         17       30293198     30293198         TCGA-AB-3009
##        t_vaf Segment_Start Segment_End Segment_Mean       CN    cluster
##        <num>         <int>       <int>        <num>    <num>     <char>
## 1: 0.4415584     145232496   145760746       0.3976 2.634629 CN_altered
## 2: 0.8419000      29054355    30363868      -0.9157 1.060173 CN_altered
## 3: 0.8958333      29054355    30363868      -0.9157 1.060173 CN_altered
#Visualizing results. Highlighting those variants on copynumber altered variants.
plotClusters(clusters = tcga.ab.3009.het, genes = 'CN_altered', showCNvars = TRUE)

6.10 突變特徵

6.10.1 需要 BSgenome.Hsapiens.UCSC.hg19

if (!requireNamespace("BSgenome.Hsapiens.UCSC.hg19", quietly = TRUE)) {
    BiocManager::install("BSgenome.Hsapiens.UCSC.hg19", update = FALSE, ask = FALSE)
}

library("BSgenome.Hsapiens.UCSC.hg19")
## Loading required package: BSgenome
## Loading required package: BiocGenerics
## 
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, aperm, append, as.data.frame, basename, cbind,
##     colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
##     get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
##     match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
##     Position, rank, rbind, Reduce, rownames, sapply, saveRDS, setdiff,
##     table, tapply, union, unique, unsplit, which.max, which.min
## Loading required package: S4Vectors
## Loading required package: stats4
## 
## Attaching package: 'S4Vectors'
## The following object is masked from 'package:utils':
## 
##     findMatches
## The following objects are masked from 'package:base':
## 
##     expand.grid, I, unname
## Loading required package: IRanges
## Loading required package: GenomeInfoDb
## Loading required package: GenomicRanges
## Loading required package: Biostrings
## Loading required package: XVector
## 
## Attaching package: 'Biostrings'
## The following object is masked from 'package:base':
## 
##     strsplit
## Loading required package: BiocIO
## Loading required package: rtracklayer
## 
## Attaching package: 'rtracklayer'
## The following object is masked from 'package:BiocIO':
## 
##     FileForFormat
laml.tnm = trinucleotideMatrix(maf = laml, prefix = 'chr', add = TRUE, ref_genome = "BSgenome.Hsapiens.UCSC.hg19")
## Warning in trinucleotideMatrix(maf = laml, prefix = "chr", add = TRUE, ref_genome = "BSgenome.Hsapiens.UCSC.hg19"): Chromosome names in MAF must match chromosome names in reference genome.
## Ignorinig 101 single nucleotide variants from missing chromosomes chr23
## -Extracting 5' and 3' adjacent bases
## -Extracting +/- 20bp around mutated bases for background C>T estimation
## -Estimating APOBEC enrichment scores
## --Performing one-way Fisher's test for APOBEC enrichment
## ---APOBEC related mutations are enriched in  3.315 % of samples (APOBEC enrichment score > 2 ;  6  of  181  samples)
## -Creating mutation matrix
## --matrix of dimension 188x96

6.10.2 APOBEC富集樣品與未富集樣品之間的差異

plotApobecDiff(tnm = laml.tnm, maf = laml, pVal = 0.2)
## -Processing clinical data
## -Processing clinical data

## $results
## Index: <Hugo_Symbol>
##      Hugo_Symbol Enriched nonEnriched       pval       or     ci.up     ci.low
##           <char>    <num>       <int>      <num>    <num>     <num>      <num>
##   1:        TP53        2          13 0.08175632 5.997646  46.60886 0.49875432
##   2:        TET2        1          16 0.45739351 1.940700  18.98398 0.03882963
##   3:        FLT3        2          45 0.65523131 1.408185  10.21162 0.12341748
##   4:      ADAM11        0           2 1.00000000 0.000000 164.19147 0.00000000
##   5:        APOB        0           2 1.00000000 0.000000 164.19147 0.00000000
##  ---                                                                          
## 132:         WAC        0           2 1.00000000 0.000000 164.19147 0.00000000
## 133:         WT1        0          12 1.00000000 0.000000  12.69086 0.00000000
## 134:      ZBTB33        0           2 1.00000000 0.000000 164.19147 0.00000000
## 135:      ZC3H18        0           2 1.00000000 0.000000 164.19147 0.00000000
## 136:      ZNF687        0           2 1.00000000 0.000000 164.19147 0.00000000
##      adjPval
##        <num>
##   1:       1
##   2:       1
##   3:       1
##   4:       1
##   5:       1
##  ---        
## 132:       1
## 133:       1
## 134:       1
## 135:       1
## 136:       1
## 
## $SampleSummary
## Key: <Cohort>
##         Cohort SampleSize  Mean Median
##         <char>      <num> <num>  <num>
## 1:    Enriched          6 7.167    6.5
## 2: nonEnriched        172 9.715    9.0

6.10.3 簽名分析

packages <- c("NMF", "pheatmap")

for (pkg in packages) {
    if (!requireNamespace(pkg, quietly = TRUE)) {
        install.packages(pkg)
    }
}

library('NMF')
## Loading required package: pkgmaker
## Loading required package: registry
## 
## Attaching package: 'pkgmaker'
## The following object is masked from 'package:S4Vectors':
## 
##     new2
## Loading required package: rngtools
## Loading required package: cluster
## NMF - BioConductor layer [OK] | Shared memory capabilities [NO: bigmemory] | Cores 27/28
##   To enable shared memory capabilities, try: install.extras('
## NMF
## ')
## 
## Attaching package: 'NMF'
## The following object is masked from 'package:S4Vectors':
## 
##     nrun
laml.sign = estimateSignatures(mat = laml.tnm, nTry = 6, pConstant=0.01)
## -Running NMF for 6 ranks
## Compute NMF rank= 2  ... + measures ... OK
## Compute NMF rank= 3  ... + measures ... OK
## Compute NMF rank= 4  ... + measures ... OK
## Compute NMF rank= 5  ... + measures ... OK
## Compute NMF rank= 6  ... + measures ... OK

## -Finished in 27.7s elapsed (6.164s cpu)
plotCophenetic(res = laml.sign)

laml.sig = extractSignatures(mat = laml.tnm, n = 3, pConstant=0.01)
## -Running NMF for factorization rank: 3
## -Finished in1.275s elapsed (1.174s cpu)
#Compate against original 30 signatures 
laml.og30.cosm = compareSignatures(nmfRes = laml.sig, sig_db = "legacy")
## -Comparing against COSMIC signatures
## ------------------------------------
## --Found Signature_1 most similar to COSMIC_1
##    Aetiology: spontaneous deamination of 5-methylcytosine [cosine-similarity: 0.738]
## --Found Signature_2 most similar to COSMIC_19
##    Aetiology: Unknown [cosine-similarity: 0.665]
## --Found Signature_3 most similar to COSMIC_1
##    Aetiology: spontaneous deamination of 5-methylcytosine [cosine-similarity: 0.682]
## ------------------------------------
#Compate against updated version3 60 signatures 
laml.v3.cosm = compareSignatures(nmfRes = laml.sig, sig_db = "SBS")
## -Comparing against COSMIC signatures
## ------------------------------------
## --Found Signature_1 most similar to SBS15
##    Aetiology: Defective DNA mismatch repair [cosine-similarity: 0.816]
## --Found Signature_2 most similar to SBS5
##    Aetiology: Unknown [cosine-similarity: 0.64]
## --Found Signature_3 most similar to SBS6
##    Aetiology: defective DNA mismatch repair [cosine-similarity: 0.619]
## ------------------------------------
library('pheatmap')
pheatmap::pheatmap(mat = laml.og30.cosm$cosine_similarities, cluster_rows = FALSE, main = "cosine similarity against validated signatures")

maftools::plotSignatures(nmfRes = laml.sig, title_size = 1.2, sig_db = "SBS")

# #  3D bar plot
# library("barplot3d")
# #Visualize first signature
# sig1 = laml.sig$signatures[,1]
# barplot3d::legoplot3d(contextdata = sig1, labels = FALSE, scalexy = 0.01, sixcolors = "sanger", alpha = 0.5)

7 變體註釋

7.1 將 annovar 輸出轉換為 MAF

var.annovar = system.file("extdata", "variants.hg19_multianno.txt", package = "maftools")
var.annovar.maf = annovarToMaf(annovar = var.annovar, Center = 'CSI-NUS', refBuild = 'hg19', 
                               tsbCol = 'Tumor_Sample_Barcode', table = 'ensGene')
## -Reading annovar files
## --Extracting tx, exon, txchange and aa-change
## -Adding Variant_Type
## -Converting Ensemble Gene IDs into HGNC gene symbols
## --Done. Original ensemble gene IDs are preserved under field name ens_id
## Finished in 0.076s elapsed (0.253s cpu)

7.2 將 ICGC Simpale Somatic Mutation Format 轉換為 MAF

#Read sample ICGC data for ESCA
esca.icgc <- system.file("extdata", "simple_somatic_mutation.open.ESCA-CN.sample.tsv.gz", package = "maftools")
esca.maf <- icgcSimpleMutationToMAF(icgc = esca.icgc, addHugoSymbol = TRUE)
## Converting Ensemble Gene IDs into HGNC gene symbols.
## Done! Original ensemble gene IDs are preserved under field name ens_id
## --Removed 427 duplicated variants
#Printing first 16 columns for display convenience.
print(esca.maf[1:5,1:16, with = FALSE])
##    Hugo_Symbol Entrez_Gene_Id Center NCBI_Build Chromosome Start_Position
##         <char>          <int> <lgcl>     <char>     <char>          <num>
## 1:  AC005237.4             NA     NA     GRCh37          2      241987787
## 2:  AC061992.1            786     NA     GRCh37         17       76425382
## 3:  AC097467.2             NA     NA     GRCh37          4      156294567
## 4:    ADAMTS12             NA     NA     GRCh37          5       33684019
## 5:  AL589642.1          54801     NA     GRCh37          9       32630154
##    End_Position Strand Variant_Classification Variant_Type Reference_Allele
##           <num> <char>                 <char>       <char>           <char>
## 1:    241987787      +                 Intron          SNP                C
## 2:     76425382      +                3'Flank          SNP                C
## 3:    156294567      +                 Intron          SNP                A
## 4:     33684019      +      Missense_Mutation          SNP                A
## 5:     32630154      +                    RNA          SNP                T
##    Tumor_Seq_Allele1 Tumor_Seq_Allele2 dbSNP_RS dbSNP_Val_Status
##               <char>            <char>   <lgcl>           <lgcl>
## 1:                 C                 T       NA               NA
## 2:                 C                 T       NA               NA
## 3:                 A                 G       NA               NA
## 4:                 A                 C       NA               NA
## 5:                 T                 C       NA               NA
##    Tumor_Sample_Barcode
##                  <fctr>
## 1:             SA514619
## 2:             SA514619
## 3:             SA514619
## 4:             SA514619
## 5:             SA514619

7.3 準備用於 MutSigCV 分析的 MAF 文件

laml.mutsig.corrected = prepareMutSig(maf = laml)
## Converting gene names for 1 variants from 1 genes
## Key: <MutSig_Synonym>
##    Hugo_Symbol MutSig_Synonym     N
##         <char>         <char> <int>
## 1:    ARHGAP35          GRLF1     1
## Original symbols are preserved under column OG_Hugo_Symbol.
# Converting gene names for 1 variants from 1 genes
#    Hugo_Symbol MutSig_Synonym N
# 1:    ARHGAP35          GRLF1 1
# Original symbols are preserved under column OG_Hugo_Symbol.

8 集合運算

8.1 子集 MAF

#Extract data for samples 'TCGA.AB.3009' and 'TCGA.AB.2933'  (Printing just 5 rows for display convenience)
subsetMaf(maf = laml, tsb = c('TCGA-AB-3009', 'TCGA-AB-2933'), mafObj = FALSE)[1:5]
##    Hugo_Symbol Entrez_Gene_Id           Center NCBI_Build Chromosome
##         <char>          <int>           <char>      <int>     <char>
## 1:      ABCB11           8647 genome.wustl.edu         37          2
## 2:       ACSS3          79611 genome.wustl.edu         37         12
## 3:        ANK3            288 genome.wustl.edu         37         10
## 4:       AP1G2           8906 genome.wustl.edu         37         14
## 5:         ARC          23237 genome.wustl.edu         37          8
##    Start_Position End_Position Strand Variant_Classification Variant_Type
##             <num>        <num> <char>                 <fctr>       <fctr>
## 1:      169780250    169780250      +      Missense_Mutation          SNP
## 2:       81536902     81536902      +      Missense_Mutation          SNP
## 3:       61926581     61926581      +            Splice_Site          SNP
## 4:       24033309     24033309      +      Missense_Mutation          SNP
## 5:      143694874    143694874      +      Missense_Mutation          SNP
##    Reference_Allele Tumor_Seq_Allele1 Tumor_Seq_Allele2 Tumor_Sample_Barcode
##              <char>            <char>            <char>               <fctr>
## 1:                G                 G                 A         TCGA-AB-3009
## 2:                C                 C                 T         TCGA-AB-3009
## 3:                C                 C                 A         TCGA-AB-3009
## 4:                C                 C                 T         TCGA-AB-3009
## 5:                C                 C                 G         TCGA-AB-3009
##    Protein_Change i_TumorVAF_WU i_transcript_name
##            <char>         <num>            <char>
## 1:       p.A1283V      46.97218       NM_003742.2
## 2:        p.A266V      47.32510       NM_024560.2
## 3:                     43.99478       NM_020987.2
## 4:        p.R346Q      47.08000       NM_003917.2
## 5:        p.W253C      42.96435       NM_015193.3
##Same as above but return output as an MAF object (Default behaviour)
subsetMaf(maf = laml, tsb = c('TCGA-AB-3009', 'TCGA-AB-2933'))
## -Processing clinical data
## An object of class  MAF 
##                    ID          summary  Mean Median
##                <char>           <char> <num>  <num>
##  1:        NCBI_Build               37    NA     NA
##  2:            Center genome.wustl.edu    NA     NA
##  3:           Samples                2    NA     NA
##  4:            nGenes               34    NA     NA
##  5:   Frame_Shift_Ins                5   2.5    2.5
##  6:      In_Frame_Ins                1   0.5    0.5
##  7: Missense_Mutation               26  13.0   13.0
##  8: Nonsense_Mutation                2   1.0    1.0
##  9:       Splice_Site                1   0.5    0.5
## 10:             total               35  17.5   17.5

8.2 指定查詢和控制輸出欄位

#Select all Splice_Site mutations from DNMT3A and NPM1
subsetMaf(maf = laml, genes = c('DNMT3A', 'NPM1'), mafObj = FALSE,query = "Variant_Classification == 'Splice_Site'")
##    Hugo_Symbol Entrez_Gene_Id           Center NCBI_Build Chromosome
##         <char>          <int>           <char>      <int>     <char>
## 1:      DNMT3A           1788 genome.wustl.edu         37          2
## 2:      DNMT3A           1788 genome.wustl.edu         37          2
## 3:      DNMT3A           1788 genome.wustl.edu         37          2
## 4:      DNMT3A           1788 genome.wustl.edu         37          2
## 5:      DNMT3A           1788 genome.wustl.edu         37          2
## 6:      DNMT3A           1788 genome.wustl.edu         37          2
##    Start_Position End_Position Strand Variant_Classification Variant_Type
##             <num>        <num> <char>                 <fctr>       <fctr>
## 1:       25459874     25459874      +            Splice_Site          SNP
## 2:       25467208     25467208      +            Splice_Site          SNP
## 3:       25467022     25467022      +            Splice_Site          SNP
## 4:       25459803     25459803      +            Splice_Site          SNP
## 5:       25464576     25464576      +            Splice_Site          SNP
## 6:       25469029     25469029      +            Splice_Site          SNP
##    Reference_Allele Tumor_Seq_Allele1 Tumor_Seq_Allele2 Tumor_Sample_Barcode
##              <char>            <char>            <char>               <fctr>
## 1:                C                 C                 G         TCGA-AB-2818
## 2:                C                 C                 T         TCGA-AB-2822
## 3:                A                 A                 G         TCGA-AB-2891
## 4:                A                 A                 C         TCGA-AB-2898
## 5:                C                 C                 A         TCGA-AB-2934
## 6:                C                 C                 A         TCGA-AB-2968
##    Protein_Change i_TumorVAF_WU i_transcript_name
##            <char>         <num>            <char>
## 1:        p.R803S         36.84       NM_022552.3
## 2:                        62.96       NM_022552.3
## 3:                        34.78       NM_022552.3
## 4:                        46.43       NM_022552.3
## 5:        p.G646V         37.50       NM_022552.3
## 6:        p.E477*         40.00       NM_022552.3
#Same as above but include only 'i_transcript_name' column in the output
subsetMaf(maf = laml, genes = c('DNMT3A', 'NPM1'), mafObj = FALSE, query = "Variant_Classification == 'Splice_Site'", fields = 'i_transcript_name')
##    Hugo_Symbol Chromosome Start_Position End_Position Reference_Allele
##         <char>     <char>          <num>        <num>           <char>
## 1:      DNMT3A          2       25459874     25459874                C
## 2:      DNMT3A          2       25467208     25467208                C
## 3:      DNMT3A          2       25467022     25467022                A
## 4:      DNMT3A          2       25459803     25459803                A
## 5:      DNMT3A          2       25464576     25464576                C
## 6:      DNMT3A          2       25469029     25469029                C
##    Tumor_Seq_Allele2 Variant_Classification Variant_Type Tumor_Sample_Barcode
##               <char>                 <fctr>       <fctr>               <fctr>
## 1:                 G            Splice_Site          SNP         TCGA-AB-2818
## 2:                 T            Splice_Site          SNP         TCGA-AB-2822
## 3:                 G            Splice_Site          SNP         TCGA-AB-2891
## 4:                 C            Splice_Site          SNP         TCGA-AB-2898
## 5:                 A            Splice_Site          SNP         TCGA-AB-2934
## 6:                 A            Splice_Site          SNP         TCGA-AB-2968
##    i_transcript_name
##               <char>
## 1:       NM_022552.3
## 2:       NM_022552.3
## 3:       NM_022552.3
## 4:       NM_022552.3
## 5:       NM_022552.3
## 6:       NM_022552.3

8.3 按臨床資料進行子集劃分

#Select all samples with FAB clasification M4 in clinical data 
laml_m4 = subsetMaf(maf = laml, clinQuery = "FAB_classification %in% 'M4'")
## -subsetting by clinical data..
## --39 samples meet the clinical query
## -Processing clinical data

9 樣品交換識別

#Path to BAM files
bams = c(
  "DBW-40-N.bam",
  "DBW-40-1T.bam",
  "DBW-40-2T.bam",
  "DBW-40-3T.bam",
  "DBW-43-N.bam",
  "DBW-43-1T.bam"
)

res = maftools::sampleSwaps(bams = bams, build = "hg19")
# Fetching readcounts from BAM files..
# Summarizing allele frequncy table..
# Performing pairwise comparison..
# Done!

res$pairwise_comparison

The returned results is a list containing:

  • a matrix of allele frequency table for every genotyped SNP
  • a data.frame of readcounts for ref and alt allele for every genotyped SNP
  • a table with pair-wise concordance among samples
  • a list with potentially matched samples
 res$pairwise_comparison
# X_bam     Y_bam concordant_snps discordant_snps fract_concordant_snps  cor_coef XY_possibly_paired
#  1: DBW-40-1T DBW-40-2T            5488             571             0.9057600 0.9656484                Yes
#  2: DBW-40-1T DBW-40-3T            5793             266             0.9560984 0.9758083                Yes
#  3: DBW-40-1T  DBW-43-N            5534             525             0.9133520 0.9667620                Yes
#  4: DBW-40-2T DBW-40-3T            5853             206             0.9660010 0.9817475                Yes
#  5: DBW-40-2T  DBW-43-N            5131             928             0.8468394 0.9297096                Yes
#  6: DBW-40-3T  DBW-43-N            5334             725             0.8803433 0.9550670                Yes
#  7:  DBW-40-N DBW-43-1T            5709             350             0.9422347 0.9725684                Yes
#  8: DBW-40-1T  DBW-40-N            2829            3230             0.4669087 0.3808831                 No
#  9: DBW-40-1T DBW-43-1T            2796            3263             0.4614623 0.3755364                 No
# 10: DBW-40-2T  DBW-40-N            2760            3299             0.4555207 0.3641647                 No
# 11: DBW-40-2T DBW-43-1T            2736            3323             0.4515597 0.3579747                 No
# 12: DBW-40-3T  DBW-40-N            2775            3284             0.4579964 0.3770581                 No
# 13: DBW-40-3T DBW-43-1T            2753            3306             0.4543654 0.3721022                 No
# 14:  DBW-40-N  DBW-43-N            2965            3094             0.4893547 0.3839140                 No
# 15: DBW-43-1T  DBW-43-N            2876            3183             0.4746658 0.3797829                 No
res$BAM_matches
# [[1]]
# [1] "DBW-40-1T" "DBW-40-2T" "DBW-40-3T" "DBW-43-N" 
# 
# [[2]]
# [1] "DBW-40-2T" "DBW-40-3T" "DBW-43-N" 
# 
# [[3]]
# [1] "DBW-40-3T" "DBW-43-N" 
# 
# [[4]]
# [1] "DBW-40-N"  "DBW-43-1T"

Results can be visualized with the correlation plot.

cor_table = cor(res$AF_table)
pheatmap::pheatmap(cor_table, breaks = seq(0, 1, 0.01))

10 TCGA cohort

10.1 可用 cohort

tcga_avail = tcgaAvailable()
head(tcga_avail, 3)
##    Study_Abbreviation                   Study_Name   MC3
##                <char>                       <char> <int>
## 1:                ACC     Adrenocortical_carcinoma    92
## 2:               BLCA Bladder_Urothelial_Carcinoma   411
## 3:               BRCA    Breast_invasive_carcinoma  1026
##                             Firehose   CCLE
##                               <char> <char>
## 1:  62 [dx.doi.org/10.7908/C1610ZNC]       
## 2: 395 [dx.doi.org/10.7908/C1MW2GGF]       
## 3: 978 [dx.doi.org/10.7908/C1TB167Z]

10.2 載入 TCGA 佇列

# By default MAF from MC3 project will be loaded
laml_mc3 = tcgaLoad(study = "LAML")
## Loading LAML. Please cite: https://doi.org/10.1016/j.cels.2018.03.002 for reference
laml_mc3
## An object of class  MAF 
##                         ID  summary   Mean Median
##                     <char>   <char>  <num>  <num>
##  1:             NCBI_Build   GRCh37     NA     NA
##  2:                 Center MC3_LAML     NA     NA
##  3:                Samples      140     NA     NA
##  4:                 nGenes     4142     NA     NA
##  5:        Frame_Shift_Del      131  0.936    0.0
##  6:        Frame_Shift_Ins      377  2.693    0.0
##  7:           In_Frame_Del        9  0.064    0.0
##  8:           In_Frame_Ins        3  0.021    0.0
##  9:      Missense_Mutation     4137 29.550    7.5
## 10:      Nonsense_Mutation      264  1.886    0.0
## 11:       Nonstop_Mutation       18  0.129    0.0
## 12:            Splice_Site      780  5.571    1.0
## 13: Translation_Start_Site        4  0.029    0.0
## 14:                  total     5723 40.879   13.0
# Change the source to Firehose
laml_fh = tcgaLoad(study = "LAML", source = "Firehose")
## Loading LAML. Please cite: dx.doi.org/10.7908/C1D21X2X for reference
laml_fh
## An object of class  MAF 
##                    ID          summary  Mean Median
##                <char>           <char> <num>  <num>
##  1:        NCBI_Build               37    NA     NA
##  2:            Center genome.wustl.edu    NA     NA
##  3:           Samples              192    NA     NA
##  4:            nGenes             1241    NA     NA
##  5:   Frame_Shift_Del               52 0.271      0
##  6:   Frame_Shift_Ins               91 0.474      0
##  7:      In_Frame_Del               10 0.052      0
##  8:      In_Frame_Ins               42 0.219      0
##  9: Missense_Mutation             1342 6.990      7
## 10: Nonsense_Mutation              103 0.536      0
## 11:       Splice_Site               92 0.479      0
## 12:             total             1732 9.021      9
BiocManager::install(pkgs = "PoisonAlien/TCGAmutations")

11 多重檢測實驗

laml_mae = maf2mae(m = laml)
## Loading required namespace: RaggedExperiment
## Loading required namespace: MultiAssayExperiment
laml_mae
## A MultiAssayExperiment object of 2 listed
##  experiments with user-defined names and respective classes.
##  Containing an ExperimentList class object of length 2:
##  [1] maf_nonSyn: RaggedExperiment with 1732 rows and 193 columns
##  [2] maf_syn: RaggedExperiment with 475 rows and 193 columns
## Functionality:
##  experiments() - obtain the ExperimentList instance
##  colData() - the primary/phenotype DataFrame
##  sampleMap() - the sample coordination DataFrame
##  `$`, `[`, `[[` - extract colData columns, subset, or experiment
##  *Format() - convert into a long or wide DataFrame
##  assays() - convert ExperimentList to a SimpleList of matrices
##  exportClass() - save data to flat files

12 套件資訊

sessionInfo()
## R version 4.4.3 (2025-02-28)
## Platform: x86_64-conda-linux-gnu
## Running under: Ubuntu 24.04.2 LTS
## 
## Matrix products: default
## BLAS/LAPACK: /home/benson/miniforge3/envs/maftools/lib/libopenblasp-r0.3.30.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
##  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
##  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
## [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
## 
## time zone: Asia/Taipei
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] pheatmap_1.0.13                   doParallel_1.0.17                
##  [3] iterators_1.0.14                  foreach_1.5.2                    
##  [5] NMF_0.21.0                        cluster_2.1.8.1                  
##  [7] rngtools_1.5.2                    pkgmaker_0.32.10                 
##  [9] registry_0.5-1                    Biobase_2.66.0                   
## [11] BSgenome.Hsapiens.UCSC.hg19_1.4.3 BSgenome_1.74.0                  
## [13] rtracklayer_1.66.0                BiocIO_1.16.0                    
## [15] Biostrings_2.74.0                 XVector_0.46.0                   
## [17] GenomicRanges_1.58.0              GenomeInfoDb_1.42.0              
## [19] IRanges_2.40.0                    S4Vectors_0.44.0                 
## [21] BiocGenerics_0.52.0               mclust_6.1.2                     
## [23] maftools_2.22.0                   DT_0.34.0                        
## 
## loaded via a namespace (and not attached):
##  [1] bitops_1.0-9                rlang_1.1.6                
##  [3] magrittr_2.0.4              gridBase_0.4-7             
##  [5] otel_0.2.0                  matrixStats_1.5.0          
##  [7] compiler_4.4.3              vctrs_0.6.5                
##  [9] reshape2_1.4.5              stringr_1.6.0              
## [11] pkgconfig_2.0.3             crayon_1.5.3               
## [13] fastmap_1.2.0               Rsamtools_2.22.0           
## [15] rmarkdown_2.30              UCSC.utils_1.2.0           
## [17] xfun_0.55                   MultiAssayExperiment_1.32.0
## [19] zlibbioc_1.52.0             cachem_1.1.0               
## [21] jsonlite_2.0.0              DelayedArray_0.32.0        
## [23] BiocParallel_1.40.0         R6_2.6.1                   
## [25] bslib_0.9.0                 stringi_1.8.7              
## [27] RColorBrewer_1.1-3          DNAcopy_1.80.0             
## [29] jquerylib_0.1.4             Rcpp_1.1.0                 
## [31] assertthat_0.2.1            SummarizedExperiment_1.36.0
## [33] knitr_1.51                  R.utils_2.13.0             
## [35] Matrix_1.7-4                splines_4.4.3              
## [37] tidyselect_1.2.1            abind_1.4-8                
## [39] yaml_2.3.12                 codetools_0.2-20           
## [41] curl_7.0.0                  lattice_0.22-7             
## [43] tibble_3.3.0                plyr_1.8.9                 
## [45] withr_3.0.2                 S7_0.2.1                   
## [47] evaluate_1.0.5              survival_3.8-3             
## [49] pillar_1.11.1               MatrixGenerics_1.18.0      
## [51] generics_0.1.4              RCurl_1.98-1.17            
## [53] ggplot2_4.0.1               scales_1.4.0               
## [55] xtable_1.8-4                glue_1.8.0                 
## [57] tools_4.4.3                 data.table_1.17.8          
## [59] GenomicAlignments_1.42.0    XML_3.99-0.20              
## [61] grid_4.4.3                  crosstalk_1.2.2            
## [63] colorspace_2.1-2            GenomeInfoDbData_1.2.13    
## [65] RaggedExperiment_1.30.0     restfulr_0.0.16            
## [67] cli_3.6.5                   S4Arrays_1.6.0             
## [69] dplyr_1.1.4                 gtable_0.3.6               
## [71] R.methodsS3_1.8.2           sass_0.4.10                
## [73] digest_0.6.39               SparseArray_1.6.0          
## [75] rjson_0.2.23                htmlwidgets_1.6.4          
## [77] farver_2.1.2                htmltools_0.5.9            
## [79] R.oo_1.27.1                 lifecycle_1.0.4            
## [81] httr_1.4.7